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THE TOPOGRAPHY OF MULTIVARIATE NORMAL MIXTURES^ 

By Surajit Ray and Bruce G. Lindsay 

University of North Carolina at Chapel Hill 
and Pennsylvania State University 

Multivariate normal mixtures provide a flexible method of fitting 
high-dimensional data. It is shown that their topography, in the sense 
of their key features as a density, can be analyzed rigorously in lower 
dimensions by use of a ridgeline manifold that contains all critical 
points, as well as the ridges of the density. A plot of the elevations on 
the ridgeline shows the key features of the mixed density. In addition, 
by use of the ridgeline, we uncover a function that determines the 
number of modes of the mixed density when there are two components 
being mixed. A foUowup analysis then gives a curvature function that 
can be used to prove a set of modality theorems. 



1. Introduction. 

1.1. The topography of a density. Fitting a mixture model offers a pri- 
mary data reduction through the number, location and shape of its compo- 
nents. However, in more complex settings we would like to know more about 
how the components interact to describe an overall pattern of density. What, 
for example, is the modal structure, or in a richer sense, the configuration 
of major features? The goal of this paper is to develop new insights into the 
topography of multivariate normal mixture densities, with the special aim 
of providing tools that are useful even in high-dimensional data. 

Description of a multimodal density is challenging even in one dimen- 
sion. For unimodal models, the density shape might be described through 
concepts like skewness and kurtosis. When the density is multimodal, the 
emphasis is usually shifted to the number and location of modes, since 
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the modes are the dominant feature, and are themselves potentially symp- 
tomatic of underlying population structures. A common approach to mul- 
timodal data structures is to use a mixture model because it provides a 
decomposition of the sampled population into a set of homogeneous compo- 
nents in a way that is consistent with the multimodal density configuration. 

However, the set of modes is not in one-to-one correspondence with the 
distinct components. For example, in a univariate normal mixture model, 
two components can be similar enough that their mixture creates a single 
mode. If the population is well described by such a two component but- 
unimodal density, the analyst can construct two competing hypotheses. One 
is that the population has two homogeneous groups with normal shape, and 
the other is that there is just one group and it has a unimodal but nonnormal 
shape. Knowledge of these competing hypotheses could then lead to further 
scientific investigation. 

Our purpose here is to describe the complex density shapes that can arise 
in a multivariate data set. To do so, we will appeal to the language and 
imagery of the earth's topography. Suppose we wish to describe the main 
features of a contour plot of a bivariate density f{x,y). We equate this to 
the problem of describing the surface features of a land mass, where the 
elevation at a point {x,y) is equated with the bivariate density f{x,y). The 
local maxima of the density are the peaks, and their location, together with 
elevation, provides a first-order description of topography. But in a richer 
sense, mountains are usually aggregated into mountain ranges, in which the 
neighboring peaks are connected through ridges. The perceived separation 
of two neighboring peaks is then determined by the elevation at the lowest 
point on this ridge, the saddlepoint between them. 

Here we will show how to create such a description for a mixture of high- 
dimensional normals or similar distributions, like multivariate-t. Our results 
are closely related to ideas in Morse theory, which is the mathematical study 
of the topology of surfaces based upon their differential structure [15, 16]. 
We will point out these relationships along the way. 

1.2. Relevant literature. The literature on determination of the number 
of modes in normal mixture models has focused primarily on univariate mix- 
tures. In fact, there is a simple description of modality when one is mixing 
two univariate components, de Helguero [5] determined necessary and suffi- 
cient conditions for bimodality in the mixture of two univariate normals with 
equal variances and mixing proportions. Later, conditions for bimodality in 
the mixture of univariate normal distributions with unequal variance and 
unequal mixing proportions were studied by Eisenberger [6], Behboodian [1] 
and Robertson and Fryer [20]. Kakiuchi [10] and Kemperman [11] addressed 
conditions for bimodality using nonnormal component densities. 
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Moreover, for the mixture of univariate normals, Robertson and Fryer [20] 
developed explicit formulae that one can use to determine if there are one 
or two modes, and further showed that there can be no more than two. This 
work, which can be shown to be a corollary of one of our results (Theorem 3), 
shows that modality is determined not only by the separation between the 
normal components, but also by the mixing proportion vr, with values near 
and 1 tending to suppress extra modes. 

Some approaches to understand the modal behavior of multivariate nor- 
mal mixtures can be found in recent machine learning literature (see [3]), 
but their results are limited to all components having the same covariance 
structure. 

1.3. Our results. At this stage let us develop our notation. We will de- 
note the dimension of the multivariate density by D and the number of 
components of the mixture by K. All our topographical methods work for 
arbitrary D, but, for dimensionality reasons, our primary focus will be on 
cases where the number of components is = 2 or 3. However, we also ob- 
tain a number of results for arbitrary K. As long as K — 1 is less than D, 
we can provide a dimension reduction to the problem. 

In particular, if = 2, a tremendous dimension reduction is possible, 
from D (arbitrary) down to one. In Section 2 we show that the problem 
of finding the modes for K components can be reduced to examining the 
density values on a {K — 1) -dimensional manifold of 5?^, which we call 
the ridgeline surface. Surprisingly, the ridgeline surface does not depend 
on the mixing proportion vr. We then design graphical and analytical tools 
to describe the density on this manifold. One of our key results is that 
this surface has a ridge-like property that enables one to determine global 
features from those that are local to the ridgeline surface. 

In Section 3 we demonstrate plots of the density on the ridgeline, which we 
call elevation plots. One of the surprising results in our investigation is that, 
in a mixture of two multivariate normals, the statement "a mixture of two 
normals cannot have more than two modes" is false, unlike the univariate 
problem. 

Next, in Section 4, we go into a deeper analysis of the modal structure. 
For K = 2 we, have constructed a function n(-) that does not depend on 
vr, but whose plot can be used to determine the number and location of 
modes for each value of vr. In Section 5 we develop a geometric analysis 
which shows that there exists a fundamental curvature function k(-) whose 
zeroes determine the modality potential of a pair of component densities. 
For D = 1 this can be used to prove the Robertson and Fryer [20] results. 
For larger D this result can be used to prove modality results for certain 
important special cases, such as equal or proportional covariance matrices. 
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Finally, Section 6 provides interpretation of our results, possible ways 
of generalizing these results and lists some unanswered questions on the 
topography of multivariate normal mixtures. 

2. The ridgeline manifold. A X-component mixture of D-dimensional 
normals can be represented by the probability density function 

5(x) = 7ri(/>(x; /x^, Ei) + ■K24'{^; fi2, ^2) H h t^kH^; /x^, Si^-), 

xG K^, 

where iTj is the mixing proportion of component j, ttj G [0, 1], J2f=i'^j = 1) 
and (?i>(x; /i, S) is the density of a multivariate normal distribution with mean 
fj, and variance S. We will sometimes use </'j(x) as shorthand notation for 
and call <j)j the jth component density. 
Our goal in this section is to show that, for any D-dimensional, K- 
component normal mixture, we can define a {K — 1) -dimensional surface, 
which is guaranteed to include all the critical points (modes, antimodes 
and saddlepoints) of the D-dimensional mixture density. We will then show 
that plotting the density values in this {K — 1) -dimensional space is com- 
pletely informative about the main topographic features of the density in 
D-dimensional space. 

2.1. The {K — 1)- dimensional ridgeline manifold. To find this manifold, 
we examine the structure of the critical points of g (i.e., all values of x 
where the first derivative of g is equal to 0). First we need to introduce 
some terminology: 

Definition 1. The {K — 1) -dimensional set of points 
(2) 



= |« G 3f?^ : a, G [0, 1], 5^ a, = l| 



will be called the unit simplex. The function x*(q;) from Sk into defined 

by 

x*(a) = [aiS^^ + QsS^^ + • • • + ai^S^^]"^ 

(3) 

X [qiS^ /X^-F 02112 ^^2^ ^OIK^j^Hk] 

will be called the ridgeline function. It will sometimes be written as x* . The 
image of this map will be denoted by M and called the ridgeline surface or 
manifold, li K = 2, it will be called the ridgeline., as it is a one-dimensional 
curve. 
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For K = 2 the ridgeline can be represented as 

(4) x*(a) = [a5]f^ + aS^^]"^[aS^Vi + aS2 V2], 

where a £ [0, 1] and a = 1 — a. As a varies from to 1, the image of the 
function x*(a) defines a curve from /x^ to fj,2- 

Remark 1. Note that the ridgeline function x*(q!) and, hence, the man- 
ifold A4, depend on the means and variances of the component densities, 
but not the mixing proportions ttj. 

We next show that all the modes and saddlepoints of the full density ^(x) 
must occur in M. We will later show how their exact location depends on 
the values of the vr^'s. 

Theorem 1. Let g{x.) be the density of a K -component multivariate 
normal density as given by (1). Then all of g{x)^s critical values and, hence, 
modes, antimodes and saddlepoints, are points in M . 

Proof. See the Appendix. □ 

The ridgeline surface has a simple structure if the component variances 
Sj are equal. The following result can be found in [3]. 

Corollary 1. // Sj = S, then the convex hull of the means fXj of the 
component densities contains all critical points of the density g. 

Proof. The manifold A4 equals the convex hull. □ 

To illustrate the ridgeline curve, we consider the following simple example: 

Example 1 (Two components, three modes, unequal variance). The 
mixture density with D = 2 and K = 2, and the parameters 

'^i = (o)' = 0.05)' 

/1\ „ /0.05 0\ 1 

Figure 1 shows the contours of the density given in Example 1. Overlaid 
on the contour plot is the ridgeline curve, showing how it passes through 
the three modes and saddlepoints of the density g. 
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Remark 2. Morse theory is used in the analysis of terrains [4] and 
watersheds [17], albeit with varying terminology. One can define a "critical 
net" or "watershed" as a map that describes the terrain through the location 
of the critical points in its elevation function, together with one-dimensional 
curves called "separatrices" that connect these critical points. A flow line 
represents the path taken by water under gravity, and a separatrix is defined 
to be a flow line that connects two critical points. The flow line from a local 
maximum to a saddlepoint therefore creates a division of the terrain into 
two "catchment basins," as flow lines do not cross. Mathematically, a flow 
line is a line of steepest descent (or ascent when reversed), so that the flow 
line is always moving in the direction of the gradient. It is therefore always 
moving orthogonally to the elevation contours. For K = 2 our ridgeline curve 
has properties similar to the separatrices, as we shall show, but is not one 
itself. In particular, the separatrices for a mixture of two normals depend 
on the value of vr, but the ridgeline curve does not. 

2.2. The ridgeline elevation plot. The next step in our analysis is to 
consider the diagnostic properties of the elevation plot, which is a plot of 
the ridgeline elevation function defined by 

h{cx)=g{-x*{cx)). 

We start by considering the case where K = 2, so a is one-dimensional. 
Figure 2(a) shows the elevation plot of the distribution of Example 1. One 




T 1 1 1 r 

-S -I 1 2 



Fig. 1. Contour plot and ridgeline curve ( ) for the mixture density given m Exam- 
ple 1. 
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might hope that the number and location of the modes and antimodes of the 
elevation plot might indicate to us the number and location of the modes 
and saddlepoints of the original density. We will make this rigorous through 
further analysis in the next subsection. 

Remark 3. The elevation plot carries an inherent visual distortion rel- 
ative to the original density plot. This distortion arises because the distance 
between two a values, say a\ and a^^ may not accurately reflect the distances 
between x*(ai) and x*(a2) as measured along the ridgeline. For example, 
the saddlepoints in the contour plot of Figure 1 have a distance from the 
endpoints that is not well represented on the a-scale in Figure 2(a). For 
K = 2 this distortion can be corrected by replacing the a scale with L{a), 
the arclength of the ridgeline path x*(a) from to a [see Figure 2(b)]. 

Remark 4. Theorem 1 can be easily generalized to other families of 
multivariate distributions. If a density characterized by parameters {fx, S) 
depends on x only as a decreasing function of the Mahalonobis distance 
(x — /^)'S~^(x — /x), then the ridgeline manifold is exactly the same as given 
here. An example of this type is the multivariate t distribution (see [13, 18]). 

2.2.1. Ridgeline-like properties. We now verify that the modes and an- 
timodes in the ridgeline elevation plots, such as in Figure 2, necessarily 
correspond to modes and saddlepoints in the original density. Recall that 
in terrain analysis [4] a separatrix is the line connecting the highest points 
along a ridge and separating drainage basins from one another. We have 



(a) 




n 1 1 r r r 

a.o nis B.4 aji M IX) 



(b) 




T [ 1 r 



O.O 0.5 1.0 l.S 2X) 

L{a) 



Fig. 2. Ridgeline elevation for the bivariate normal mixture of Example 1 along the 
ridgeline path x*(a), expressed as a function of (a) parameter a and (b) the arclength 
L{a). Three local maxima representing the three modes of the density are visible near 
a = 0, 0.5 and 1 in (a), which corresponds to L{a) =0, 1 and 2 in (b). 
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already indicated that our ridgelines are not true separatrices. However, 
confirmation that the ridgehne elevation plot is fully informative about the 
main features of g arrives by showing that the ridgeline surface has local 
properties similar to a separatrix. This section is devoted to making this 
idea mathematically precise. 

If a person walks from peak A to peak B along a separatrix path, then 
this passage would be characterized by the fact that to the left and right, 
perpendicularly, the ridge falls away into two drainage basins. In other words, 
one stands at a local maximum in elevation of the mountain cross-section 
perpendicular to the path. Because of this, any time one reaches a point 
along the path that is locally maximal in elevation relative to the other 
points on the path, it must also be a local maximum to the entire nearby 
surface. By the same logic, when one reaches a local minimum along the 
path, it must be a saddlepoint to the surface. In this way one can infer the 
nature of the critical points on the whole surface from just the elevations 
along the path. 

The motivation for our analysis arises from the following geometric prop- 
erty of the ridgeline curve x*(a) when K = 2. Consider any contour {x: 
(/!)i(x) = c} of the first component density. This forms an ellipse. Provided 
that ^2 is not inside this ellipse, one can create a contour of the second 
component, say {x: ^2(x) = d}, such that the two ellipses intersect at a sin- 
gle point xq. One can show that xq is necessarily a point on the ridgeline 
curve, and, in fact, all points on the ridgeline curve are the "kissing points" 
of two such ellipses. Now, consider any point x that is on the hyperplane 
separating the two ellipses, but is not xq. Since it is in neither ellipse, it 
must have a smaller density value under (f)i and (j)2 than xq does. But this 
means that (7(x) < g(xo), regardless of the value of vr. That is, xq is a lo- 
cal maximum relative to all points in the hyperplane. In fact, it must be a 
maximum relative to all points not in the two ellipses, independently of the 
value of vr. 

It is also clear (from its above geometric description) that the local direc- 
tion of the ridgeline path never lies in the hyperplane. Hence, if the derivative 
of the elevation along this path becomes zero at a point, there is a full rank 
set of directions in which the directional derivatives are zero and, hence, the 
gradient is zero at this point, making it a critical point. 

To make this argument precise and to extend it to the multivariate case, 
we need to develop some notation, li K = 2, so that a is a scalar, the 
derivative vector d(a) = dx{a)/da of the ridgeline points in the direction of 
the curve's travel, li K > 2, then ct is a vector, and the K — 1 derivative 
vectors of x(q) with respect to qi , . . . , ax-i, written di , . . . , dx-i, represent 
a basis for the space of possible directions of travel within the ridgeline 
surface. Using (3), these can be found as 

di = 5'~^(vj-Vi^), 
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where for fixed a we have defined the matrix Sa = J2^j^j vectors 
Vj = SJ^(x*(q!) — fij). We next define a Hnear subspace of vectors that are 
orthogonal to the surface's direction vectors dj in an appropriate sense, 

W = {w:w'5c.dj =0 Vj = 1}. 



Theorem 2. // v^^ gW, then along the path {:x.{a) + 6w. 6 e ^} the 
function g{x) takes its maximum value at 6 = 0. 

Proof. See the Appendix. □ 

Corollary 2. Every critical point a of h{a.) corresponds to a critical 
point x(q:) of g{x). A critical point a of h{a) gives a local maximum of 
g{x.) if and only if it is a local maximum of h{cx). If D > K — 1, so that 
the h{a) plot is a true dimension reduction, then g(x) has no local minima, 
only saddlepoints and local maxima. In general, for D > K — 1, at a critical 
point of h(a) whose second derivative matrix has m negative eigenvalues the 
function (?(x) will have a critical point whose second derivative matrix has 
an additional D — K + 1 negative eigenvalues corresponding to the dimension 
of the orthogonal directions w. 

Proof. The directional vectors, together with their orthogonal comple- 
ment W, span the space, and we know that the W vectors are all directions 
of local maximization. □ 

3. Some illustrative examples. Before we proceed further with the the- 
ory, we present some examples to illustrate the theory and methods devel- 
oped to this point. We also use them to motivate the next set of theoretical 
developments. 

3.1. Elevation plots for K = 2. Let us return to Figure 1. It appears 
from the contour shapes that there are three modes, all lying on the ridge- 
line. This shows that the multivariate normal case has a very different, and 
more complex, modal structure than the univariate. (Carreira-Perpihan and 
Williams [3] also give an example in which there are three modes.) From 
Figure 2 we can see clearly that there are indeed three modes. (Note: the 
modes that appear to be at the endpoints actually occur a slight distance 
from the ends, although it is not visible in this resolution.) We also can 
see that the central mode is dominant, and the minima (corresponding to 
saddlepoints in Figure 1) are relatively shallow. Note that the contour plot 
in Figure 1 is not available unless D = 2, but the elevation plot works for 
any D. 
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This example above opens up a natural question regarding the existence 
of an upper bound to the number of modes in a two component mixture of 
multivariate normals. We have constructed another example where we can 
find four modes in three dimensions. 

Example 2 (Two components, four modes, unequal variance). For K = 
2 and D = 3, let the parameters be 



Figure 3 is the ridgeline elevation plot of the mixture in Example 2. It has 
four modes, although they are not clearly visible from the plot in Figure 3(a). 
We can more clearly see them in Figures 3(b) and 3(c), which show the 
ridgeline elevation plot for narrower ranges of values. Determination of the 
exact location in a of the modes can be done numerically using any nonlinear 
optimization software (we used the nlm function in R). In our case we found 
that the four modes are located at a = 0.00084,0.137,0.863 and 0.99916. 

These examples raise the question: can we create further mathematical 
tools that will guide us in the determination of the number and location 
of modes? The answer is yes, but before tackling this we consider elevation 
plots in a higher dimension. 

3.2. Elevation plots for K = 3. For K = 3 the dimension of a and, 
hence, the ridgeline surface, is two, so we suggest using a contour plot of 
h{a) in the a coordinate system to look for critical points of the density. 

To distinguish between the contour plot of an original density (possi- 
ble only when D <2) and its contour described along the ridgeline surface 
(available for arbitrary D), we will denote the former as the density contour 
and the latter as the ridgeline contour. 

A distance-preserving way to represent the simplex ^3 in is as an equi- 
lateral triangle. The vertices of the triangle correspond to the three equidis- 
tant points ei = (1,0,0), 62 = (0,1,0) and 63 = (0,0,1) of the simplex. Each 
point in the triangle corresponds to a point (ai, 02, 03), where aj equals the 
length of the perpendicular dropped to the side opposite to the vertex ej. 
We will use the symbol aj at the corner because the distance from the 
opposing baseline gives the aj value. At the corner itself, aj equals one. 

We plot the ridgeline contours on this triangle. We have shown that the 
number of peaks on the contour plot is exactly the number of modes of 
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Fig. 3. Ridgehne elevation plot of density of Example 2. (a) For the whole range of a, 
(b) magnified for a € (0,0.0002) and (c) magnified for a £ (0.1,0.16). 

the 3-component normal. We can also find exact positions of the modes 
by determining the simplicial coordinates a, then computing x*(a); these 
locations will depend on the values of vr that are used in g. 

Example 3 (Three components, three modes, equal variance). For K = 
3, D = 2, let the covariance matrix be common and the parameters be 

/Xl=(o), f^2={l), /^3=(o)' 

^=[0 ij' vri=^2 = ^3=3- 

Figure 4 shows the modality surface contour plot of Example 3. The three 
peaks along the three corners of the triangle are easily visible. In fact, these 
three modes lie very close to the corners, which implies that the three modes 
of the density g are close to the three means of the component densities. Of 
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Fig. 4. Ridgeline contour plot of the density of Example 3. 

course, for this example one could have done a contour plot of g itself because 
D = 2. 

Now we move on to an unequal variance example. 



Example 4 (Three components, five modes, unequal variance). For K ■ 
3, D = 2, let the parameters be 

»), - 

1 



'0.05 
1 



If we carefully look at the ridgeline contour plot of Figure 5(a), we can find 
five modes, which can be verified from its density contour plot in Figure 5(b). 
The five modes are (i) near 03 = 1, (ii) near ai = 1, (iii) near the centroid 
(one with height of the contour being 0.23), (iv) near ai = 02 = 0.5 and 
(v) near 02 = 03 = 0.5. In this example, although Figures 5(a) and 5(b) 
carry the same modal information, we can see rather dramatically that the 
ridgeline contour plot distorts the distances and angles between the modes 
relative to the density plot. 

Example 5 (Iris data). Next, we consider a three component mixture 
model fit to Fisher's iris data [7] . This is the dataset made famous by Fisher, 
who used it to illustrate principles of discriminant analysis. Data on four 
variables, namely. Petal width, Petal length, Sepal width and Sepal length, 
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Fig. 5. (a) Ridgeline contour plot and (b) density contour plot of the three component 
mixture density of Example 4 with five modes. 



were collected on flowers of three iris species: Setosa, Verginica and Versi- 
color. Each species had 50 observations. 

Since Z) = 4, direct contour plotting of g is not available. The mixture 
models we used are the maximum likelihood fits to the data set assuming 
unequal variance. Examining the ridgeline contour plot in Figure 6(a), we 
conclude that the three component fit actually corresponds to three dif- 
ferent modes, the modes being near the mean for each component, which 
corresponds to the three vertices in the simplex of Figure 6(a). 




Fig. 6. Ridgeline contour plot o^ 



if (a) Example 5 and (b) Example 6. 
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Example 6 (Egyptian skull data). This data consists of four measure- 
ments: Maximal breadth, Basibregmatic height, Basialveolar length and Nasal 
height of male Egyptian skulls from five different time periods (4000 BC, 
3300 BC, 1850 BC, 200 BC, 150 AD). Thirty skulls were measured from 
each time period [21]. 

Here we analyze the three earliest time periods and fit a three component 
normal mixture with unequal variances. Examining the ridgeline contour 
plot [Figure 6(b)], we observe that the three components, pertaining to the 
three time periods, produce a single mode. 

Remark 5. More expressive detail for the two-dimensional plots of this 
section could have been obtained by displaying the critical net of the density 
using, for example, the approximation techniques of Danovaro et al. [4] . This 
would show the maxima, saddlepoints and separatrices over the manifold 
region based on evaluating the elevation at a finite network of points. 

4. The Il-plot. Until this point we have focused on graphical techniques 
that are based on displaying the elevation of the density on the ridgeline. 
These techniques are quite elementary, and carry full information about 
the location and relative heights of the modes and saddlepoints. We now 
turn to a technique that focuses instead on the location of the modes and 
saddlepoints and not their elevations. By doing so, we can gain important 
insights into how the number and location of the modes depend on the values 
of vr for a given fixed set of component densities. For this section we will 
consider only the case K = 2. We will treat the component parameters as 
fixed throughout the analysis. 

Recall the ridgeline curve x*(a) for K = 2, defined in (3). If x*(a) is a 
critical value of h{a), then it satisfies 



where prime ""' denotes differentiation with respect to a. Also, recall that, 
for fixed component parameters, the value of a completely specifies the 
vector x*(a) independently of vr. For notational ease, any function of the 
D-dimensional vector x{a) will be written as the same function of the scalar 
a, for example, (pi{a) and (f)i{x{a)) are the same. Solving the last displayed 
equation for vr, and turning it into a function of a, we get 



It follows that if a corresponds to a critical value for the density g, then it 
solves the "pi-equation" 



h'{a) = 7r(/.i(x*(a))' + #(/>2(x*(a))' = 0, 




(5) 
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This gives us a recipe for finding the critical points of the density g, given 
a particular choice of vr: first, we solve for the set of a that satisfy the pi- 
equation (5), and then, from these a-solutions, calculate the critical points 
x*(a). If we have a unimodal density for a given vr, there must be one and 
only one solution to the pi-equation, which corresponds to the mode of the 
density. A bimodal density will have three vr-critical points, the first and the 
third (in the order of magnitude) corresponding to the two modes. 

To aid in finding solutions to the pi-equation, we describe n(a) on a G 
[0, 1]. We first claim that 

n(o)=o, n(i) = i, n(a)G[o,i]. 

To show this, first notice that (j)i{a) increases on the range a £ [0, 1) because 
(j)i{ct) > 0; also = 0. On the other hand, 4>2i(^) is decreasing on range 

a G (0, 1] as 02(a) < and (j)2{0) = 0. This establishes the result. 

We can also derive the following simple calculation formula for n(Q): 

1 _ (t)2{a)-4>i{a) ^ a Mc^) 
n(a) 4'2i'^) a(p2io!)^ 

which can be verified by routine calculus. 

As an example, let us examine the Il-plot of the two component bivari- 
ate normal mixture with three modes given in Example 1. As the mixing 
proportion in Example 1 is vr = 0.5, we would draw a horizontal line across 
the {a, 11(a)) plot (Figure 7) at height vr = 0.5. This line crosses the curve 
five times at (a = O(approx.), 0.004, 0.5, 0.996, l(approx.)). Among these, 
a = O(approx.), 0.5, l(approx.) correspond to the three modes, as was veri- 
fied by the ridgeline elevation plot (see Figure 2). 




s 



' I 1 1 T 1 1 

0J> l>f a* M CL« 1.D 

a 

Fig. 7. Il-plot of Example 1. denotes mixing proportion 7r = 0.5. The ranges of tt 

for which the distribution has two and three modes are given by the light band and dark 
band, respectively. 
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The same Il-plot can also be used to determine the number of modes of 
g as TT varies but the component densities remain unchanged. For example, 
if the means and variances are unchanged in Example 1, there will be three 
modes if and only if 0.256 < vr < 0.744 (dark band in Figure 7). Moreover, if 
vr < 0.0225 or vr > 0.975 (the unshaded region), then we will have a unimodal 
density; for all other vr's, we will have a bimodal density (light band). 

Similarly, for Example 2 we can find the range of mixing proportions 
for which the distribution will have one, two, three or four modes by ex- 
amining the plots in Figures 8(a) and 8(b). There is a narrow range, vr € 
(0.49974,0.50026), for which the parameters generate a distribution with 
four modes. The number of modes is at least two if vr G (0.25,0.75) and at 
least three if vr G (0.489,0.511). 

When K > 2, Il-plots can be used to examine the modal structure of each 
pair of component densities, where the mixing proportions in the paired 



(a) 



a 



(b) 













'i 


















1 





a 

Fig. 8. Il-plot of Example 2. The range of n for which the mixing distribution has two 
modes is given by the light band; with three modes it is given by the medium band; with 
four modes it is given by the dark band. The first two bands are visible from sub-plot (a), 
whereas the two darker bands are more clearly visible in sub-plot (b), which is magnified 
/or ttG (0.499,0.501). 
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Fig. 9. H-plots of pairs of component densities in the three- component mixture densities 
of (i) Example 4, (ii) Example 5 and (iii) Example 6, for component pairs (a) {1,2}, 
(b) {2, 3}, (c) {3, 1}, respectively, along with the appropriate mixing proportion represented 
by the horizontal line in each plot. 

mixture is determined by the relative weight in the whole mixture. To il- 
lustrate, we re-examine (i) Example 4, (ii) Example 5 and (iii) Example 6. 
The ridgeline contour plots of the above examples were already presented in 
Section 3. Now we present the pairwise Il-plots of the three components fit 
for each dataset (Figure 9). 

For the pairwise Il-plots of Example 4 [Figure 9(i)], we observe that, 
while comparing pairs {1,2} and {2,3}, the horizontal line crosses the 11- 
curve five times, implying the presence of three modes. Similarly, examin- 
ing the pairwise plots for the iris data [Figure 9(ii)], we see that each of 
the pairwise components is bimodal. Moreover, the plot tells us that the 
three components are well separated from each other for almost every set 
of mixing proportions. But, looking at the plots for the Egyptian skull data 
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[Figure 9(iii)], we can easily note that all the plots imply that the pairwise 
components exhibit a single mode. 

5. Analytic tools for detecting modality. 

5.1. The curvature function. We have seen that the 11 function deter- 
mines the modality structure of the mixture model as it depends on vr. Our 
next step is to look more deeply into the properties of this function. Now 
if n is strictly increasing in a, there will be exactly one solution to the pi- 
equation, and so the mixture densities are unimodal across all values of vr. 
Multiple solutions in a of the pi-equation can only arise if n(a) oscillates 
up and down as a increases. In particular, a careful analysis shows that 
n is increasing at a mode of h{a), and decreasing where h{a) has a local 
minimum. Therefore, the number and location of the n(a) critical points 
are informative about the number and location of the modes of the mixture 
density. 

Referring back to Section 4, the number of up-down oscillations of 11 is 
determined by the zeroes of 



In general, to determine the sign changes of 11', we can use any function 
of a with the same numerator 02(a)0'i(a) — (Pi {0^)4' 2(^)1 provided the de- 
nominator is a positive function of a. In this paper we will use the curvature 
function K{a) defined by 



We use k(q) as it results in a simple expression for any distribution belonging 
to the exponential family. We will call it the curvature function because its 
zeroes occur at the zeroes in curvature of the curve given by {(j)i{a),(l)2{a)). 
It is closely related to the mixture curvature measures given by Lindsay [12]. 

5.2. Properties of the curvature function K{a). We now study the role 
of the curvature function K{a) more closely. 

Based on our description of 11, it is clear that, at the first zero, ai, of k, the 
function 11 has a maximum, at the next 02 a minimum, and so forth. Thus, 
if we calculate the values of n(Qj) = ttj, then we can determine from them 
the ranges of vr in which we will have modes and antimodes. If the function 
K{a) does not change its sign in the range a £ [0, 1] , then the density is 
unimodal for all tt. If K{a) has exactly two zeroes at ai and 02, then, for vr 
between n(a2) and n(ai), the mixture will have two modes, and otherwise 
there will be just one; and so forth. 
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In our next result, we develop a simple expression for the curvature. Recall 
that we defined Sa = aT,^^ + aS^^. 

Theorem 3. Let g{x.) be the mixture of two multivariate normal densi- 
ties. Then 

(9) K{a) = \p{a)]'^[l — aap{a)], 

where p{a) = (/^a - tJ'iy^i^S^^^2^S~'^^2^S-^^^^{fJ-2 - /^i)- 
Proof. Given in the Appendix. □ 

Now, using the above theorem, since p{ct) is always positive, it is clear 
that the zeroes of k(q) are the same as the zeroes of (1 — aiyp^a)). For 
notational ease, let us denote 

q{a) = 1 — aap{a). 

By calculation, q{0) = q{l) = 1 and, hence, k takes positive values at the 
two extremes a = and 1. Thus, there are an even number of sign changes 
of the function K{a) in the range [0, 1], as also indicated by the nature of IT. 

In some special cases one can analytically describe the zeroes of n{a). In 
the corollaries which follow we show that our curvature result can be used 
to duplicate some of the univariate results found in the literature, and we 
extend them to certain multivariate situations. 

Corollary 3. In the equal variance case (Y^i = reduces to a 

quadratic expression 

q{a) = 1 - a(l - a){n2 - /Xi)'S"^(/i2 - fJ-i). 

q{a) and, hence, K{a) have two roots iff (7x2 — /X]^)'i;^^(/i2 — /^i) > 4, in 
which case the mixture will be bimodal if and only if tt £ {■ki,tt2), where 

J_ _ -j^ _^ Oi ^i(ai) 

and the ai are the two solutions in [0, 1] of q{a) = 0. 
Proof. Straightforward calculation. □ 

The above corollary is the generalization of the conditions for bimodality 
given for the univariate equal variance normal mixture density in [1, 5, 6]. We 
can also replicate, in the multivariate case, the unequal variance results for 
univariate densities, provided we make a proportional variance assumption. 
For this we need the following lemma. 
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Lemma 1. For the proportional variance case (Y,2 = a^T.i), the zeroes 
of K,{a) in the range [0, 1] are the same as the zeroes in [0, 1] of the following 
cubic in a: 

(10) qi{a) = (cj^(l -a)+ of - a{l - a)fi^a'^, 

where ^j? = - /Xi)'Sj"^(/i2 - /^i). 

Proof. See the Appendix. □ 

With the machinery we have now developed, we can offer a simpUfied proof 
of the results of Robertson and Fryer [20], while extending those results to 
higher dimensions. 

Corollary 4. Let g he the mixture of two normal densities with means 
fii and fj,2 variances Si and T,2 = cr^Si. 

(a) The density of f is unimodal for any mixing proportion vr if 

,1 , ^ 2(1 - + a^f' - (2a6 - 3a^ - 3a' + 2) 

(b) // the parameters do not satisfy the above condition, f is bimodal if 
and only if tt £ (711,772), where 

7Ti ai 4>2{ai) 

and the ai are the two solutions in [0, 1] of 

(fj^(l - a) + af - a{l - a)a'{fi2 - tJ.iy^i^ifJ.2 - /^i) = 0. 
Proof. See the Appendix. □ 

6. Conclusion. In this paper we have developed some powerful tools for 
understanding the topography of a multivariate normal mixture model. The 
tools are especially powerful in the case K = 2, where we can reduce our 
problem from D dimensions down to one. In any problem one can produce 
simple plots to investigate the key features of the density. In certain cases 
we can even describe analytically the number of modes and their locations. 

In the process of doing this analysis, we have not discussed how these 
new results might be used for statistical purposes. We think the possibilities 
are rich. For example, consider the clustering problem. If we fit a mixture 
of normals to high-dimensional data, we can associate the components with 
clusters of data [14]. However, we might also be interested in the information 
about how well separated two clusters are. A ridgeline elevation plot of their 
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estimated densities will show if they are close enough to each other to form a 
single mode, in which case we are unlikely to think of them as well separated. 

In a model with many components, one might define two components 
to be linked if they together form a single mode, and then use a map of 
the linkages to identify how the clusters are associated with each other. 
(We could also link them if the "mountain passes" are high relative to the 
peaks.) This could also lead one to a more compact description of the data 
structure through the construction of supercomponents consisting of linked 
components, then describing the model as a mixture of a smaller number of 
sup er comp onent s . 

We also note that there are still a number of open mathematical questions. 
For example, can we find the zeroes of the curvature function k analytically 
in any important special case other than the ones that are given here? Of 
this we are doubtful. However, finding exact formulae does not seem so 
important, as we believe it is possible to produce an elementary numerical 
algorithm to find these points. 

In this paper we have reduced the dimension of the mixture modality 
problem to min(i^r — 1,D). It was pointed out by a referee that this is exactly 
the same dimension reduction as occurs in discriminant analysis [2, 7, 8, 9, 
19], where one wishes to discriminate between K populations on the basis of 
D measurements. We think this is an important insight, and that, by using 
discriminant functions, it will be possible to link our mixture modality study 
with the area of discriminant analysis in a way that is mutually beneficial. 

Through a number of examples in this paper, we have shown that the to- 
pography of a multivariate mixture is not like the univariate, as the number 
of modes can be significantly more than the number of components. As a 
second question, one might ask if there exists an upper bound for the num- 
ber of modes, one that can be described as a function of K, the number of 
components, and D, the dimension of the multivariate mixture. 

Finally, our results are most effective for K = 2. It would therefore be 
useful to establish relationships between the modality structure of the pairs 
of densities in a mixture and the overall modality of the entire mixture. 

Note. Datasets and their parameter estimates used in this paper are 
available for download at www.stat.psu.edu/~surajit/topography/. 



A.l. Proof of Theorem 1. Suppose that Vg{-x*) = 0, so x* is a critical 
point. Then we have 



APPENDIX: PROOFS 



(A.l) 




V'kk4>k{^*) 



V0/r(x*) 
<Ai^(x*) ■ 
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If we let 

(A. 2) Ctj = -. r ) 

7ri0i(x*) +7r2<?;>2(x*) H h vri^-^i^ (x*) 

then, obviously, < < 1 and Y^f=i = 1. Further, note that 



(x*;/x,S) 



S-^(x*-/x). 



Thus, we have, from equation (A.l), that for every critical value x* there 
exists an cx such that 

(A.3) aiSr^(x* - /ii) + asS^^x* - /X2) + • • • + airS^^(x* - ^Ik) = 0. 
Solving this equation for x* gives the theorem. 

A. 2. Proof of Theorem 2. First, we note the geometric importance of 
the vectors Vj. We know that the point x(q;) lies in one of the elliptical 
contours of the density (pj. At this point the gradient in x of the density 
4>ji^) is proportional to Vj, and so Vj is orthogonal to the contour. Thus, 
if we were to start at x(a) and travel in any direction w orthogonal to vj, 
our path is in the support hyperplane to the elliptically shaped upper set 
{x:(j)j(x) > (j)j{x.*{a.))}. Using the fact that the ellipse is convex, our path 
lies outside the ellipse, and so in the set {x:i^j(x) < (f)j{x{a))}, except for 
equality at x = x*{a). That is, the point x.*{a) is a local maximum to (/>j(x) 
along any path orthogonal to Vj. 

Now, suppose that w satisfies the assumptions of the theorem. It follows 
from the form of dj that 

(A.4) w'(v,-v/<) = 0. 

However, due to the definition of x(a) we also have J2 ^^j^j — 0) so 
X]aj(vj — vk) = — vx- Putting this together with (A.4) shows that w'vk = 
0, and so w'vj = for j = 1,. . . ,K. That is, w is orthogonal to every v^, 
and so, by the first paragraph every component of the mixture density g{x) 
is locally maximized along the given line {x(q;) + 6w : 5 G K} at 6 = 0, and, 
hence, so is g{x). 

A.3. Proof of Theorem 3. For simplicity of calculation, we will reuse the 
notation Vj introduced in Section 2.2.1. Also, let us denote the first and the 
second derivatives of a vector x with respect to the scalar a by x and x, 
respectively. We will also use the same notation for the first and the second 
derivatives of the likelihood function. 

The equation defining x*, which is 
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avi + av2 = 



a 



Vi = --V2. 

a 



can be written as 
(A.5) 
(A.6) 

Differentiating (A.5) w.r.t. a, we get 

(A.7) QS^^i*(Q) - vi + aSg ^i*(a) + V2 = 

=^ (aSj"-'^ + aS^-'^)x*(a) = vi - V2 
vi 



(A.8) 



a 



[using (A.6)]. 



Taking /j(x) =log(0j(x)) for j = 1,2, the curvature k(q), given by (8), can 
be calculated via 



It can be shown that 
h = -x*(a)vi, 



^ = lj and ^ = /^. + [/,f 



12 = -x*(a)vi, 

a 



li = — x*(a)vi — x*(a)vi, 'I2 = — x*(a)vi H — x*(q;)vi — x*(a)vi^. 



a 



a 



After simplification. 



x*(a)vi 



a 



X*(q;)vi 



-aa 



a 



Writing in terms of the original parameters. 



x*(a)vi 



{^l2- ^ll)'T.^^s^^T.2^s^^T.2^sJl:^^{n2- ^J'l) =p{a)- 



a 



Remark 6. Note that p{oi) function is symmetric in the component 
labels, as it should be, because one can show by direct calculation that 

Y^ — 1 Q— 1 V — ^ 

^1 '~>a ^2 — ^2 '-'a ^1 • 

A.4. Proof of Lemma 1. S2 = cr^Si ^ Sa = - a) + ^). Using 

this relation, 

, (/^2 - tJ'iy^l^^l^2'^^1^2^^l^i^{fJ-2 - Ml) 



q{a) = 1 — a(l — a) 
= 1 — a(l — a) 



((l-a) + a/cj2)3 

(/^2-/^i)'^rn/^2-/^i) 



a4((l-a) + a/a2)3 
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T n ^ ^^^^ 
= 1 — a{l — a) 



(cj2(l-a)+a)3 

1 



rgi(a). 



((j2(l-a)+a)3 

Since (o"2(l — a) + a)'^ is positive in [0, 1], the zeroes of q{a) and qi{a) are 
the same in that range. 

A. 5. Proof of Corollary 4. We first prove part (a). Using Lemma 1, we 
know that the zeroes of K,{a) are the same as the zeroes of the cubic = 
((t2(1 — a) + a)3 — a(l — a)^'^a'^. Now, (71(a) has more than two real zeroes 
iff the discriminant of qi{a) is nonnegative. This condition is equivalent to 

s(^) = - ^?{-Aa^ + Qa^ + - 4) - 27^2(^2 - 1)^ > 0, 

that is, 

(A.9) s(^) = + 2^2(^2 _ 2)(^2 ^ ;L)(2a2 - 1) - 27a'^{a'^ - if > 0. 

Now s(/i) is a quadratic in /_f2 and, among its two zeroes only one is positive. 
The positive zero is given by 

2 _ 2(1 - ^2 + ^4)3/2 _ ^2^6 _ 3^4 _ 3^2 ^ 2) 

Also note that s(0) < 0. Thus, s(^) is positive when /i2 > jsjq^ applying 
Theorem 3 and Lemma 1, in the proportional variance case we find that g 
is unimodal if 

(/^2-/^i)'5^rn/^2-/^i) 

(A.IO) 

2(1 - (j2 + a^f^ - {2a^ - Sa^ - 3^2 + 2) 



< 



ct2 



Note that this condition is exactly the condition Robertson and Fryer [20] 
derived for the univariate case. 

Now we prove part (b) . We have already noted that the change of curva- 
ture of the 11(0) occurs at the same place as the solutions to K{a) = 0, which 
in turn are the roots of the cubic equation qi{a) = for the proportional 
variance case. Let ai and 02 be the two roots of 

qi{a) = (cr^(l — a) + of — a{l — a)^^a^ = 

lying between and 1. Thus, the range of vr for which the density is bimodal 
can be obtained from the range of a for which k,[q) < 0, which is the inter- 
val (ai,Q;2). Using the relation in (6), the range of vr can be derived as being 
the open interval {iti,tt2), such that 

— = l + —— — - for z = 1,2. 
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